getwd() setwd(‘~/desktop/tools/r/WhiteWines’)

chooseCRANmirror(graphics=FALSE, ind=1)
knitr::opts_chunk$set(echo = FALSE,message = FALSE,warning = FALSE)

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

White_Wines

Loading Dataset

## 
## The downloaded binary packages are in
##  /var/folders/7v/jrlxtfqx5sb6y5520z15qjsr0000gn/T//RtmpEwEsiB/downloaded_packages

Univariate Analysis:

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

We can look into the structure of our wine dataset. Which will help us to prepare for visualization.

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

In this table we are trying to understand the range of quality variables.

##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

To look into the summary of mean median mode of the dataset to analyse and get use to it

To understand the distribution of quality.we can see that we have very high values for quality 6 compared to others.

To understand the distribution of fixed acidity with respect to freq poly. we can see that mean values comes around 3.15 ph

we can see the mean of fixed acidity comes around 6 to 8. Most of the wine quality has this range.

To understand the distribution of citric acid and volatile acidity in the data. Citric acid has a mean value around 0.25 for its distribution and whereas volatile acidity has around 0.25 as mean in white wine.

From this we can see that density has a mean value around 1. It has around 900 counts.

From this we can see that sulphates count increases and decreases around 0.4 to 0.6. we dont know yet whether it is a good factor or bad

In this graph we can see that there is very high sugar value concentration around 0 to 1. So most of the wine quality has this sugar value.

Distribution of chlorides is normal based on the visualization. But there are many outliers present in it.

Alcohol count decreases and increases on certian value. Further analysis are needed to view certain regions where it increases or decreases.

## 
## The downloaded binary packages are in
##  /var/folders/7v/jrlxtfqx5sb6y5520z15qjsr0000gn/T//RtmpEwEsiB/downloaded_packages

Inthis data we try to understand the difference between both free sulfur dioxide and total sulphur dioxide. we can see that free sulphur dioxide has more concentration than total sulphur dioxide.

Univariate Summary:

From this we have analyzed the certain topics in the dataset. Summary and structure of the data is shown in the graph which helps us identify the meaning of the data. Since we are taking quality as our subject we can see from the histogram of the quality that the mean is 6. From the frequency poly we can see that the acidity and ph levels are around 6 to 8 and 3 to 3.3. Citric acid has a mean value around .3. The box plots and histogram of the remaining variables will help to understand the range and data of the dataset.We have created the box plots to see the outliers present in the data.

Bivariate Plot Analysis

## 
##  Pearson's product-moment correlation
## 
## data:  ww$fixed.acidity and ww$quality
## t = -8.005, df = 4896, p-value = 1.48e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.14121974 -0.08592991
## sample estimates:
##        cor 
## -0.1136628

In this box plot we can measure fixed acidity with respect to quality. From this we can see that quality rank 9 has higher fixed_acidity compared to others.

In this graph we can see the quality factor based on chlorides.From this boxplots we can see that quality level 7,8,9 has lower chloride level compared to others.

Using GGpairs to find the relationships between all variables

GGpairs graph is quite useful to find the relationship between variables and find relationships based on the correlation.

round pH is required to perform exploration which we cannot do with pH

We can explore the data between Quality and Alcohol by using boxplots. In this plots we can see the high Quality median value increasing based on alcohol content.

In this we can see the relationship between density and alcohol and how the concentration of alcohol decreases when increase in density. Increase in density also affects quality. Linear model is shown about the decrease in alcohol content.

Bivariate summary:

In the first when we plot between quality and acidity, quality and chlorides. we can see that acidity levels are low for good quality and poor quality. So acidity levels does not change the quality levels.Even the chloride levels indicate the same results, where high and low quality levels have same chloride content.From the ggpairs we can see the relationship between all the variables in that quality and alcohol correlate much better than others. We have rounded the ph value to its nearest decimal and transformed the data.

MultiVariate Analysis

In this data we have taken x axis as quality and y axis as alcohol and colored with density to show the data based on rounded ph

In this graph we can see the relationship between alcohol content and density. As density increases alcohol content decreases and from the quality we can see it reduces. Most of the dark blue dots are in the top.

From this Scatterplot we can approxiamtely see that good quality wine will have a ph of 3 to 3.4 with less density and alcohol level about 10 - 14.

From this graph we can see that residual sugar in density with respect to quality.we can see that concentration of residual sugar is decreasing with respect to quality. But even low quality has very less residual sugar. So we cant come to any conclusion

From this we can see that even low quality and high quality has very high ph with respect to residual sugar.

From this we can see that most of the good Quality wine has very high ph of 3.75 with 100:0.99 concentration of total.sulfur.dioxide: density. This is important in analyzing quality in wine.

This scatterplot will help us to understand the relationship between alcohol and density. We cannot really find a pattern here because most of the values seen are in the region 5 - 7. So this pattern is very difficult to analyze the relationship between alcohol and density with respect to quality.

In this scatterplot we can see percentahge of chloride content present in alcohol. This would be useful in measuring the role of cholride and ph in alcohol.

In this scatterplot, we can see that as quality increases, the concentration of citric acid in alcohol content increases towards 12 to 13.

From this graph we can see the reaction between alcohol and sulphates with respect to quality. We can see that the quality increases the concentration of sulphates in alcohol increases towards 12 - 14.

From various analysis done above we can say that good quality wine will have a ph of 3 to 3.4 with less density and good alcohol level. So alcohol content and quality correlate with each other. But when density increases in the wine it decreases the alcohol content and thus decreases the quality of wine.The sulphate content and chloride content decreases in concentration with respect to alcohol. These are some of the factors i observed in the explanatory data analysis process

Final Plots and Summary: Plot 1:

Plot 1 Analysis

From this Scatterplot we can approxiamtely see that good quality wine will have a ph of 3 to 3.4 with less density and alcohol level about 10 - 14.

Plot 2:

Plot 2 Analysis

From this box plots we can see clearly see the variation present in between quality and alcohol. From quality ranking 5 the alcohol content increases linearly with respect to quality. This is clear indication of positive relarionship between quality and alcohol.

For Quality ranking 3 and 4 we can see from plot 1 that it might have got affected by density. The other factors the wine quality might have been affected by is due to the concentration of chlorides or sulphur dioxide as we have seen earlier.

Plot 3:

Plot 3 Analysis

From Plot 2 we can see the effect of Quality and Alcohol, how it interrelates. From this scatterplot we can see that chlorides in alcohol plays a vital role in understanding quality. Chlorides content in alcohol should be in the ratio of 0.0 to 0.05 to 11 to 14 parts of alcohol. When the wine content does not satisfies this relationship, the quality of wine drops.In order to increase the quality, we have to make sure that it satisfies this criteria.

Reflection

For the given dataset it was difficult to find the relationships between variables. It was quite difficult to understand the chemical properties present in White wines. The most important issue was there wasn’t much data avaialable for higher quality. Quality 8 and 9 are much less compared to others. Still i tried my best to understand the data and find exploratory realtionships between them. Since there is not any variable linearly corelated with others it is quite useless to perform linear regression.

Quality variable is used as dependent variable and others as independent. Based on this approach analysis is been made. We did some analysis based on acidity and ph levels, but none were successful in finding the results. Further approach gave some relationships between density and alcohol and alcohol and quality. These 2 variables gave some knowledge to classify the quality of the wine and anlalysis has been made on them. When density is less with good alcohol content and formidable Ph and chloride concentration will give us good Quality wine.

If there is enough dataset for higher qualities of wine in future, it will help us identify key features to analyze the chemical properties. The other factor missing is the price. Since price of wine could have helped us to broaden the search for quality and could give us further more relationships to predict the price.

Anyway from the data we can see that chlorides, alcohol content, density and Ph played crucial role understanding the quality. Since quality is my dependent variable i hope i did my best in describing the data.